Goto

Collaborating Authors

 target preference



Robust Preference Alignment via Directional Neighborhood Consensus

Mao, Ruochen, Shi, Yuling, Gu, Xiaodong, Wei, Jiaheng

arXiv.org Artificial Intelligence

Aligning large language models with human preferences is critical for creating reliable and controllable AI systems. A human preference can be visualized as a high-dimensional vector where different directions represent trade-offs between desired attributes (e.g., helpfulness vs. verbosity). Yet, because the training data often reflects dominant, average preferences, LLMs tend to perform well on common requests but fall short in specific, individual needs. This mismatch creates a preference coverage gap. Existing methods often address this through costly retraining, which may not be generalized to the full spectrum of diverse preferences. This brittleness means that when a user's request reflects a nuanced preference deviating from the training data's central tendency, model performance can degrade unpredictably. To address this challenge, we introduce Robust Preference Selection (RPS), a post-hoc, training-free method by leveraging directional neighborhood consensus. Instead of forcing a model to generate a response from a single, highly specific preference, RPS samples multiple responses from a local neighborhood of related preferences to create a superior candidate pool. It then selects the response that best aligns with the user's original intent. We provide a theoretical framework showing our neighborhood generation strategy is provably superior to a strong baseline that also samples multiple candidates. Comprehensive experiments across three distinct alignment paradigms (DPA, DPO, and SFT) demonstrate that RPS consistently improves robustness against this baseline, achieving win rates of up to 69% on challenging preferences from under-represented regions of the space without any model retraining. Our work presents a practical, theoretically-grounded solution for enhancing the reliability of preference-aligned models.


An Offline Adaptation Framework for Constrained Multi-Objective Reinforcement Learning

Neural Information Processing Systems

In the standard reinforcement learning (RL) setting, the primary goal is to obtain a policy that maximizes a cumulative scalar reward [Sutton and Barto, 2018].


Phi: Preference Hijacking in Multi-modal Large Language Models at Inference Time

Lan, Yifan, Cao, Yuanpu, Zhang, Weitong, Lin, Lu, Chen, Jinghui

arXiv.org Artificial Intelligence

Recently, Multimodal Large Language Models (MLLMs) have gained significant attention across various domains. However, their widespread adoption has also raised serious safety concerns. In this paper, we uncover a new safety risk of MLLMs: the output preference of MLLMs can be arbitrarily manipulated by carefully optimized images. Such attacks often generate contextually relevant yet biased responses that are neither overtly harmful nor unethical, making them difficult to detect. Specifically, we introduce a novel method, Preference Hijacking (Phi), for manipulating the MLLM response preferences using a preference hijacked image. Our method works at inference time and requires no model modifications. Additionally, we introduce a universal hijacking perturbation -- a transferable component that can be embedded into different images to hijack MLLM responses toward any attacker-specified preferences. Experimental results across various tasks demonstrate the effectiveness of our approach. The code for Phi is accessible at https://github.com/Yifan-Lan/Phi.


An Offline Adaptation Framework for Constrained Multi-Objective Reinforcement Learning

Lin, Qian, Liu, Zongkai, Mo, Danying, Yu, Chao

arXiv.org Artificial Intelligence

In recent years, significant progress has been made in multi-objective reinforcement learning (RL) research, which aims to balance multiple objectives by incorporating preferences for each objective. In most existing studies, specific preferences must be provided during deployment to indicate the desired policies explicitly. However, designing these preferences depends heavily on human prior knowledge, which is typically obtained through extensive observation of high-performing demonstrations with expected behaviors. In this work, we propose a simple yet effective offline adaptation framework for multi-objective RL problems without assuming handcrafted target preferences, but only given several demonstrations to implicitly indicate the preferences of expected policies. Additionally, we demonstrate that our framework can naturally be extended to meet constraints on safety-critical objectives by utilizing safe demonstrations, even when the safety thresholds are unknown. Empirical results on offline multi-objective and safe tasks demonstrate the capability of our framework to infer policies that align with real preferences while meeting the constraints implied by the provided demonstrations.


Policy-regularized Offline Multi-objective Reinforcement Learning

Lin, Qian, Yu, Chao, Liu, Zongkai, Wu, Zifan

arXiv.org Artificial Intelligence

In this paper, we aim to utilize only offline trajectory data to train a policy for multi-objective RL. We extend the offline policy-regularized method, a widely-adopted approach for single-objective offline RL problems, into the multi-objective setting in order to achieve the above goal. However, such methods face a new challenge in offline MORL settings, namely the preference-inconsistent demonstration problem. We propose two solutions to this problem: 1) filtering out preference-inconsistent demonstrations via approximating behavior preferences, and 2) adopting regularization techniques with high policy expressiveness. Moreover, we integrate the preference-conditioned scalarized update method into policy-regularized offline RL, in order to simultaneously learn a set of policies using a single policy network, thus reducing the computational cost induced by the training of a large number of individual policies for various preferences. Finally, we introduce Regularization Weight Adaptation to dynamically determine appropriate regularization weights for arbitrary target preferences during deployment. Empirical results on various multi-objective datasets demonstrate the capability of our approach in solving offline MORL problems.


Predicting A Creator's Preferences In, and From, Interactive Generative Art

Parikh, Devi

arXiv.org Artificial Intelligence

As a lay user creates an art piece using an interactive generative art tool, what, if anything, do the choices they make tell us about them and their preferences? These preferences could be in the specific generative art form (e.g., color palettes, density of the piece, thickness or curvatures of any lines in the piece); predicting them could lead to a smarter interactive tool. Or they could be preferences in other walks of life (e.g., music, fashion, food, interior design, paintings) or attributes of the person (e.g., personality type, gender, artistic inclinations); predicting them could lead to improved personalized recommendations for products or experiences. To study this research question, we collect preferences from 311 subjects, both in a specific generative art form and in other walks of life. We analyze the preferences and train machine learning models to predict a subset of preferences from the remaining. We find that preferences in the generative art form we studied cannot predict preferences in other walks of life better than chance (and vice versa). However, preferences within the generative art form are reliably predictive of each other.